Fix the snapshot summary of a partial overwrite#1879
Conversation
kevinjqliu
left a comment
There was a problem hiding this comment.
i dont mind at all. Thanks for picking this up
kevinjqliu
left a comment
There was a problem hiding this comment.
Thanks for the fix @Fokko, LGTM.
I just have a few comments on whether Spark is generating the correct snapshot summary.
| @@ -467,21 +467,19 @@ def test_partitioned_table_positional_deletes_sequence_number(spark: SparkSessio | |||
| assert snapshots[2].summary == Summary( | |||
There was a problem hiding this comment.
i ran this locally. snapshot 0 has the following metadata, which is correct
{
'added-data-files': '2',
'added-files-size': '1490',
'added-records': '5',
'changed-partition-count': '2',
'total-data-files': '2',
'total-delete-files': '0',
'total-equality-deletes': '0',
'total-files-size': '1490',
'total-position-deletes': '0',
'total-records': '5'
}
but snapshot 1, DELETE op with the positional delete, has the following metadata,
{
'added-delete-files': '1',
'added-files-size': '1710',
'added-position-delete-files': '1',
'added-position-deletes': '1',
'changed-partition-count': '1',
'total-data-files': '2',
'total-delete-files': '1',
'total-equality-deletes': '0',
'total-files-size': '3200',
'total-position-deletes': '1',
'total-records': '5'
}
everything looks right except for the total-records. we started off with 5 records, and the DELETE op removed 1 record. So total-records should be 4 here.
Is this a bug in the spark snapshot summary?
There was a problem hiding this comment.
Interesting. Looking at it, snapshot summary 1 seems incorrect. Could you open up an issue on the Java side? It would be good to get some historical context around this.
There was a problem hiding this comment.
I was able to reproduce it using #1926. I'll also open an issue on the java side
There was a problem hiding this comment.
apache/iceberg#12823 opened an issue on the java side
Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
|
Thanks for the review @kevinjqliu 🙌 |
# Rationale for this change @kevinjqliu PTAL. I took the liberty of providing a fix for this since I was curious where this was coming from, hope you don't mind! I've cherry-picked your commit with the test.  Java produces: ```json { "added-data-files": "1", "added-files-size": "707", "added-records": "2", "app-id": "local-1743678304626", "changed-partition-count": "1", "deleted-data-files": "1", "deleted-records": "3", "engine-name": "spark", "engine-version": "3.5.5", "iceberg-version": "Apache Iceberg 1.8.1 (commit 9ce0fcf0af7becf25ad9fc996c3bad2afdcfd33d)", "removed-files-size": "693", "spark.app.id": "local-1743678304626", "total-data-files": "3", "total-delete-files": "0", "total-equality-deletes": "0", "total-files-size": "1993", "total-position-deletes": "0", "total-records": "4" } ``` # Are these changes tested? # Are there any user-facing changes? <!-- In the case of user-facing changes, please add the changelog label. --> --------- Co-authored-by: Kevin Liu <kevin.jq.liu@gmail.com> Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
# Rationale for this change @kevinjqliu PTAL. I took the liberty of providing a fix for this since I was curious where this was coming from, hope you don't mind! I've cherry-picked your commit with the test.  Java produces: ```json { "added-data-files": "1", "added-files-size": "707", "added-records": "2", "app-id": "local-1743678304626", "changed-partition-count": "1", "deleted-data-files": "1", "deleted-records": "3", "engine-name": "spark", "engine-version": "3.5.5", "iceberg-version": "Apache Iceberg 1.8.1 (commit 9ce0fcf0af7becf25ad9fc996c3bad2afdcfd33d)", "removed-files-size": "693", "spark.app.id": "local-1743678304626", "total-data-files": "3", "total-delete-files": "0", "total-equality-deletes": "0", "total-files-size": "1993", "total-position-deletes": "0", "total-records": "4" } ``` # Are these changes tested? # Are there any user-facing changes? <!-- In the case of user-facing changes, please add the changelog label. --> --------- Co-authored-by: Kevin Liu <kevin.jq.liu@gmail.com> Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
Rationale for this change
@kevinjqliu PTAL. I took the liberty of providing a fix for this since I was curious where this was coming from, hope you don't mind! I've cherry-picked your commit with the test.
Java produces:
{ "added-data-files": "1", "added-files-size": "707", "added-records": "2", "app-id": "local-1743678304626", "changed-partition-count": "1", "deleted-data-files": "1", "deleted-records": "3", "engine-name": "spark", "engine-version": "3.5.5", "iceberg-version": "Apache Iceberg 1.8.1 (commit 9ce0fcf0af7becf25ad9fc996c3bad2afdcfd33d)", "removed-files-size": "693", "spark.app.id": "local-1743678304626", "total-data-files": "3", "total-delete-files": "0", "total-equality-deletes": "0", "total-files-size": "1993", "total-position-deletes": "0", "total-records": "4" }Are these changes tested?
Are there any user-facing changes?